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A method for embedding data into speech signals without recourse 
to bandwidth expansion is proposed. Sampled speech is assembled 
into contiguous blocks of N samples and the Discrete Fourier Trans- 
form (DFT) is performed on each block. All the phase components in 
the message band, or the last J components in this band, are dis- 
carded when unvoiced or voiced speech is present, respectively. The 
data is introduced in the place of these rejected phase components, 
being +tt/2 for a logical and —n/2 for a logical 1. The magnitude 
of the coefficients associated with the data-carrying phase compo- 
nents are scaled to guard against data errors resulting from channel 
noise. The inverse DFT yields the transmitted sequence. The receiver 
performs the inverse process, stripping off the data and replacing it 
with random phase values. For an average transmission rate of 
approximately 1 kb/s and a channel signal- to -noise ratio of 30 dB, 
the bit error rate was 5.5 X 10~ 4 , and the average signal-to-noise 
ratios for voiced and unvoiced speech were 24 and —3 dB, respectively. 
However, the unvoiced sounds were perceived with negligible distor- 
tion owing to the preservation of their magnitude spectra. Modest 
error- correction codes can be used to reduce the bit error rate to 10~ 7 
while maintaining the same recovered speech quality, provided the 
average transmitted bit rate is decreased to =500 b/s. 

I. INTRODUCTION 

Embedding data in speech signals without a significant enlargement 
of signal bandwidth has a great attraction if the data can be recovered 
without error, and the degradation of the speech is perceptually 
acceptable. There is a euphoria of getting a bargain, almost something 
for nothing. Of course it is not serendipity, but rather an exploitation 
of the innate redundancy in speech. 

A recent proposal by Steele and Vitello 12 for the simultaneous 
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transmission of speech and data signals attempted to preserve the 
speech signal while accepting a bandwidth expansion of the transmit- 
ted signal. In their system the speech conveys the data using the 
principles of analog speech scrambling. The data becomes the scram- 
bling key, while the receiver acts the part of a code breaker. Every 
time the code is deciphered correctly the receiver recovers both the 
data and the speech. Codes are therefore selected that are easy to 
break. Frequency inversion scrambling was used to achieve data rates 
of 700 b/s over ideal channels, and 125 b/s when additive channel 
noise was as high as 10 dB below the mean square value of the speech 
signal. In both cases there were no data errors associated with the 
39,000 speech samples used in the experiments. 

We now propose a system for the simultaneous transmission of 
speech and data that avoids a bandwidth expansion of the transmitted 
signal compared to that of the original speech signal, but does engender 
a modest reduction in the perceptual quality of the received speech. 

II. THE SYSTEM 

The combined transmission of data and speech in our proposed 
system is achieved by discarding some phase components in the speech 
signal and replacing them with data. At the receiver the data are 
removed and replaced with random phase components. By judicious 
choice of which phase components are used for the conveyance of 
data, we are able to ensure that the recovered speech quality is only 
marginally degraded by the presence of the data. 

The speech signal bandlimited between 200 Hz and 3.2 KHz is 
sampled at 8 KHz and divided into sequential blocks each containing 
N samples. To decide whether a block of samples is to convey data, 
and if so, how many bits, we perform what is tantamount to a crude 
voice, unvoiced, or silence detection. The mean square value ai of the 
samples in the block is computed and compared with two thresholds, 
T\ and T 2 . These thresholds float compared with the mean square 
value ^ °f tne speech calculated over many blocks, such that Tx and 
T 2 are ax and a 2 dB below 2, respectively. From inspection of five 
sentences we experimentally determined that ax = 18.5, and a 2 = 30. 
The mean square value a\ is compared with these thresholds and the 
decision to transmit B x or B 2 bits of data is made according to 

a\ < T 2 ; NO DATA TRANSMITTED (1) 

T 2 < oS < Tx ; Bi BITS TRANSMITTED (2) 

T x < o|; B 2 BITS TRANSMITTED, (3) 

where 

B 2 <B i . (4) 
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Although our decision as to whether to transmit data, and if so, 
whether B\ or B 2 bits will be embedded in the speech signal, depends 
only on Inequalities (1) to (3), we may consider that to a good 
approximation these Inequalities refer to the presence of silence, 
unvoiced speech, or voiced speech, respectively. Observe that as a 
consequence of Inequalities (1) to (3) the bit rate is variable, being 
dependent on the presence and nature of the speech signal. As the 
system is conceived for embedding data into speech signals, we envis- 
age conventional modem techniques being deployed for the transmis- 
sion of data during prolonged silences, 3,4 assuming a time assignment 
speech interpolation (TASI)-type arrangement is not in service. 

Provided Inequality (2) or (3) is satisfied, the discrete Fourier 
Transform (DFT) is performed on the block of speech samples 
[x(n))n=o, namely, 

N-l & k 

X(k) = X x(n)e N ; k = 0, 1, . . . , N - 1 (5) 

n-0 

or 

X(k) = Re(k) + jlm(k) , (6) 

where Re(k) and Im(k) are the real and imaginary components of X(k), 
respectively. The magnitude of X(k) is 

\X(k) | = yjReHk) + Im 2 (k) (7) 

and its phase angle is 

The procedure for assigning data depends on whether Inequality (2) 
or (3) occurs. 

2.1 Unvoiced speech 

If Inequality (2) is satisfied, the speech is almost certainly unvoiced. 
When unvoiced speech occurs the vocal cords do not vibrate, and the 
sounds originate because of turbulent air flow at a constriction in the 
vocal tract. Unvoiced sound has a noise-like nature and tends to have 
low energy. The former characteristic is valuable when data, transmit- 
ted as the phase components in the unvoiced speech signal, are 
removed at the receiver and replaced with random phase components. 
The re-introduced phase components have a similar randomness to 
the original components, and the perceptual quality of the sound is 
negligibly degraded. The low energy of unvoiced speech is, by contrast, 
an undesirable feature when data is embedded in the phase compo- 
nents, as channel noise may precipitate large variations in the phase 
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of the received signal causing a high bit error rate (BER). Conse- 
quently, steps must be taken to increase the energy of the unvoiced 
sounds. 

2.1.1 p-law spectral scaling 

The effect of channel noise can be mitigated by scaling the magni- 
tude of the spectral components according to the ju-law, 5 producing 
magnitude components 

Vlogjl + fl uo — 

log(l + flttv) 

= V; \X(k)\*V, (9) 

where \i uv is the compression factor for unvoiced speech and V is the 
/x-law overload parameter. The factor pm, is selected to provide an 
acceptably low BER, and also to contain the amplitude range of the 
transmitted signal. The experimental determination of p uv is discussed 
in Section IV. The components | D(k) | are calculated for k spanning 
the voice bandwidth, i.e., k c \ to k c2 , where k c \ and k c i are the spectral 
component associated with 200 and 3200 Hz, respectively. 

2.1.2 Data insertion 

Having described the scaling of the magnitude of the frequency 
components, we now consider how the data of B\ bits are embedded in 
the phase spectrum. All the unvoiced phase components over the 
speech bandwidth are discarded and replaced with binary phase com- 
ponents determined by the data. As the phase angle $(k) is confined 
to ± 7r radians, we arrange for phase components carrying data to be 
designated 0(k) and have values 

6{k) = it/2, signifying logical 

= — it/2, signifying logical 1 (10) 

for k = k c \ to k C 2. Although multi-level 0(k) does increase the amount 
of data embedded in the speech blocks, we opted for binary 9{k) to 
make the system more robust to channel impairments. Unless other- 
wise stated we will assume that every phase components contains a 
data bit, whence 

Bi = ka-ka+l. (ID 

However, in the presence of channel impairments we may allocate 
each bit to an odd number of phase components, and decide on the 
logical value of the bit at the receiver by a simple majority vote of the 
logical values associated with the received phase components. More 
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complex channel coding techniques can be employed to further reduce 
BER. 

2.1.3 The combined data and unvoiced speech sequence 

The combined speech and data signal is obtained by performing the 
inverse discrete Fourier transform (IDFT) on the magnitude and phase 
spectral components. The original spectral coefficients are 

X(k)"\X(k)\e j * lk) (12) 

and those that carry data are 

D(k) = \D(k)\e im , (13) 

where |Z)(£)| is given by eq. (9). The combined data and speech 
sequence is 



tm-i 



£ X(k)e N + £ D(k)e N 

A=0 A=A„, 



N-2-k c2 .^ N-l-k cl 2* ik 

+ I X(k)e N + £ D(k)e N 

A=Ar,..,+ l k=N-l-k r o 



N-l ^ k l 

+ I X(k)e N ' 

k-N-k,.. 



(14) 



Some values of 0(k) are used to inform the receiver whether Ine- 
quality (2) or (3) applies. This side information only constitutes a 
minor part of the transmitted data. The receiver is able to determine 
if Inequality (1) is valid by examining the mean square value of the 
received signal blocks. 

2.2 Voiced speech 

In voiced speech the vibration of the vocal cords causes broad 
spectrum puffs of air to excite the vocal tract, and the short-time 
Fourier spectrum of the speech has a quasi-periodicity, and an energy 
level considerably in excess of that encountered with unvoiced speech. 
Consequently, if data is loaded onto too many phase components, the 
quasi-periodicity of the recovered voiced speech will be disturbed and 
the speech quality degraded. We therefore discard only J phase 
components, 

J<kc2-k e i, (15) 

whenever Inequality (3) is satisfied, and replace them with Bi bits. 
Typically, J is 0.16 to 0.33 of k C 2 — k c \. Of course, Inequality (3) may 
sometimes occur when unvoiced speech is present, but only J phase 
components will be used for the conveyance of data. Although the 
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occurrence of Inequality (3) signifies that fewer phase components are 
available for data transmission, this state is important as voiced speech 
is approximately four times more prevalent than unvoiced speech. The 
maximum value of B 2 is J, and this will be assumed to occur unless 
otherwise stated. However, we note that if we assign error protection 
coding to the binary phase components carrying data, B 2 decreases, 
and so does the BER compared to the situation when B 2 is equal to J. 

Spectral scaling using ju-law is also necessary for voiced speech to 
ensure that for the expected channel noise the BER is negligible. 
Equation (9) is applicable, where | D{k) | is determined over the range 
of J coefficients. The combined voiced speech and data sequence 
conforms to eq. (14) with the exception that D(k) extends over the 
range of J components. 

The block diagram for embedding data into the speech signal is 
displayed in Fig. 1. The speech signal sampled at f a is directed into 
either shift register SRi or SR 2 , with switches Si and S 2 changing then- 
positions every N/f s seconds. While the speech samples are being 
extracted from SRi, say, the computation of the mean square value 
o 2 x of the current N speech samples entering SR 2 is in progress. After 
N/f s seconds o\ is determined and compared with parameters T\ and 
T 2 . If Inequality (1) prevails the output of comparators COMP.l and 
COMP.2 are logical and logical 1, respectively, and consequently the 
NOR gate is in the logical state. Both switches S3 and S 4 receive 
logical signals and remain open, preventing data from being placed 
into the data store. When switches Si and S 2 change, the logical 1 state 
of COMP.2 is also used to inhibit the speech samples from entering 
the Fast Fourier Transform (FFT) circuit, and instead routes the 
contents of SR 2 directly to the output, bypassing the data-embedding 
system. This later arrangement is not shown in Fig. 1. Should In- 
equality (3) occur, COMP.l, COMP.2, and the NOR gate occupy 
logical 1, 0, and states, respectively. Switch S3 closes, and (B 2 -e) bits 
are passed into the data store, where e is employed to inform the 
receiver that Inequality (3) applies. If both COMP.l and COMP.2 are 
in the logical state, the NOR gate becomes a logical 1, closing switch 
S 4 . Data of (Bi-e) bits proceed via switch S 4 into the data store, where 
this time e signifies the presence of Inequality (2). Thus, e need be 
only one bit, unless protection coding is added. If speech is deemed to 
be present, the speech in SR 2 is applied to the FFT device, and the 
magnitude \X(k) | and phase <}>(k) components of the block of speech 
samples generated. The \X(k)\ and <j>(k) components are passed via 
switches S5 and S7 either directly to the Inverse Fast Fourier Trans- 
form (IFFT) via switches S 6 and S 8 , or are subjected to spectral scaling 
and data insertion according to the number B 2 or B\ bits removed from 
the data store. If voiced speech occurs only J components of <j>(k) are 
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used, but when B\ is present (k C 2 — k ci ) components of <j>(k) are 
converted to d(k). Observe that the spectral component(s) for e is 
always located in the same location in the J spectral region. The 
sequences at the output of switches S 6 and S 8 are applied to the IFFT 
to yield the combined speech and data sequence [gi] . 

III. THE RECEIVER 

We will refrain from discussing the numerous methods by which the 
combined speech and data signal can be transmitted, nor will we 
address the variety of channels, their attendant equalization, or the 
techniques of correctly locking the receiver clock and the attendant 
acquisition of sample and block synchronization. Rather, we will 
assume that the combined signal is correctly sampled and ordered into 
the correct blocks. 

The receiver's first task is to remove the data, but before that the 
receiver must ascertain if data have been transmitted. This is relatively 
straightforward since if data are embedded in the speech block, spectral 
scaling of the coefficients will have been performed at the transmitter, 
and the mean square value of the combined signal is significantly 
greater than T 2 of Inequality (1). If no data is considered to be present, 
the received signal is accepted as the received speech signal. When 
data is deemed to be present, we need to determine whether it is 
located in J or k C 2 — k c \ phase components. Accordingly, the FFT is 
taken of the combined data and speech sequence, and the spectral 
phase component(s) associated with the e bit(s) examined so the 
receiver can determine if Inequality (2) or (3) applies. Once this is 
accomplished the data are extracted from the received phase compo- 
nents 0(k) that are known to contain binary information, according to 

< d(k) <tt, logical generated 

and 

—it < 6(k) < 0, logical 1 generated. (16) 

Having removed the data we proceed to recover the speech signal. 
The missing phase components are replaced by phase components 
having any value between ±tt with equal probability. Those coefficients 
whose magnitudes were scaled are then de-scaled by inverse /x-law 
operation. The IFFT follows, and the speech sequence {*,} so formed 
contains distortion, which is most serious near the ends of the blocks. 
A simplified explanation of this distortion is as follows. Consider 
successive DFT spectra to contain one line, with the phase of this line 
changing every block by 77/2, while the amplitude of the spectral lines 
remains constant. The time waveform is a sinusoid whose phase 
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changes by 77/2 at the ends of the blocks. Now consider many spectral 
components whose phase changes by a random value between blocks. 
We may again consider these components to be transformed into the 
time domain as sinusoids with abrupt phase changes at the block 
boundaries. In the case of speech, each spectral component has a 
different magnitude, and J or all the components may have their 
phases randomized. The end of block distortion ensues, its values 
varying from one block boundary to another in a manner difficult to 
quantify. 

To mitigate end of block distortion we apply median filtering 6 ' 7 to 
those samples at the ends of adjacent blocks. Thus, samples between 
the mth and (m + l)th blocks are median filtered to give 

M-(2i-\) 

XmN+j+i = MED [XmN+j, X/nN+j+1, ' ' ' , XmN+j+i, • ' • , XmN+j+2iJ, (17) 
j—M 

where i is a constant for a particular filter whose length is 

L = 2i+1, i = 1, 2, 3, • • • , (18) 

i.e., the median value of L samples is the filtered sample. The number 
of samples median filtered in the vicinity of the block boundaries is 

A = 2(M+l-i) (19) 

and the number of samples used in the filtering of A samples is 

y = 2(M + 1), (20) 

where M is a system parameter. 

As an illustration of how the median filter equations are used, 
consider the example of a three-point median filter, L = 3, and M = 5. 
Equation (17) becomes for these parameters 

•i 

XmN+j+l = MED [XmN+j, X/nN+j+1, X m N+j+2J, (21) 

y— 5 

and XmN+j+i is the median value of XmN+j, XmN+j+i and XmN+j+2. The 
number of samples used in the filtering process is y = 12, which will be 
made up of six samples at the end of the mth block and six samples at 
the commencement of the (m 4- l)th block. There are A = 10 samples 
median filtered commencing with x m N-4 when 7 = —5, and terminating 
with XmN+5 when j = 4. Thus, the number of terms L in the brackets of 
Eq. (17) gives the number of samples used in the filtering process of 
each sample. As j steps from — M to M — (2i — 1), the sample being 
filtered, namely x m N+j+i, also changes under the control of/ 

After A samples have been filtered, the recovered speech sequence 
is obtained, having these A samples, and N — X components from {.£,} 
for each block of N samples. The median filtering significantly reduces 
the end of block distortion. 
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IV. RESULTS 

The sentences, "Glue the sheet to the dark blue background," "Rice 
is often served in round bowls," "Four hours of steady work faced us," 
and "The box was thrown beside the parked truck," were used in our 
experiments. The first two sentences were spoken by females, the 
remainder by males. These concatenated sentences constituting our 
speech signal were bandlimited between 200 Hz and 3200 Hz and 
sampled at 8 KHz to provide the input speech sequence, {Xi}. Random 
binary data were introduced into the phase components of {*,*} in the 
manner described in Section II. The information € was assumed to be 
received without error. This is a reasonable assumption because e can 
be specified by one bit and as the error rate of the phase components 
carrying data will be shown to be 0.055 percent, the probability of e 
being in error can be rendered negligible by assigning a small number 
of error-correcting bits to e. The block size N was 256. 

Because of the spectral scaling, the first experiments related to the 
increases in the peak and rms values of the combined speech and data 
sequence, {#,}, compared to those of the input speech sequence, {xi}. 
We were concerned that the spectral scaling might significantly in- 
crease the amplitude and power levels of the original speech signal and 
overload the communication channel. To observe the effect of spectral 
scaling on the amplitude components in {#,-} we proceeded as follows. 
Each block of speech was examined, and those blocks where Inequality 
(3) applied, our so-called voiced blocks, were noted. Using these blocks 
we calculated two signal expansion parameters, which we defined as, 

rvA y^\s^i (22) 

and 

p. 4 | *t §S5J, (23) 

where |g|max,.,,. and |x|max,.,,i, and Sl g , v ,i and &*,„,, are the maximum and 
rms values of the combined speech and data sequence, and the input 
speech sequence, in the ith blocks, respectively. The number of voiced 
blocks is \p v , where the subscript v is used to signify the applicability 
of Inequality (3). Figures 2 and 3 show the variation of r v and p v as a 
function of the spectral scaling factor, fi v , for voiced speech. As more 
phase components are used to convey data, i.e., increasing J, more 
spectral magnitude components are increased by /z-law spectral scaling; 
and on performing the IFFT the rms and maximum amplitudes of the 
voiced blocks in the transmitted signal are increased, and hence r v and 
p v are increased for a given fi v . Similarly, for a given J the effect of 
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Fig. 2 — Variation of the average ratio of maximum amplitudes of the transmitted 
signal to the input speech signal, r,., as a function of the /i-law scaling factor, u„, for 
voiced speech for different values of J. 




600 



Fig. 3 — Variation of the average ratio of the rms value of the transmitted signal to 
the input speech signal, p,. , as a function of the /i-law scaling factor «,. for voiced speech 
for different values of J. 



increasing /*„ is to increase the spectral scaling of the magnitude 
components, which consequently increases r v and p t ,. 

By selecting only those blocks where Inequality (2) applied, we 
found the signal expansion factors r uv and p uv using the same proce- 
dures as employed for r v and p, [see eqs. (22) and (23)]. The subscript 
uv, an abbreviation for unvoiced speech, implies the validity of Ine- 
quality (2). The variation of r uv and p ul . as a function of the spectral 
scaling factor, \i uv , for unvoiced speech is displayed in Fig. 4. Unlike 
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Fig. 4 — Variation of r u „ and p„„ as a function of a„„. 

the small increases in r v and p v that occur for voiced speech, the effect 
of spectral scaling all the magnitude components by large values of 
[iuv results in substantial increases in r uv and p uv . However, unvoiced 
speech has much lower magnitude and rms values than voiced speech, 
enabling much larger values ofr uv and p uv to be used, thereby protecting 
the data from channel noise. If it is required that the peak or rms value 
of {gi} is not to exceed that of {*,-}, an attenuator must be placed 
after the IFFT in Fig. 1. The effect of such an attenuator on the 
recovered signal-to-noise ratio (s/n) and BER will be discussed later. 
An objective criterion for the quality of the recovered speech signal 
should take cognizance of the particular process being used to convey 
data, yet be sufficiently well known to have comparative value. In this 
system the distortion in the recovered speech signal originates from 
two main processes, namely, the randomizing of the phase components 
of voiced and unvoiced speech, and the effect of additive channel noise. 
The randomization of the phase components does not alter the re- 
covered magnitude spectra, and thus spectral distortion measures 8 
based on spectral power are inappropriate. The only errors in the 
magnitude spectra derive from the channel noise. Signal-to-noise ratio 
measurements are familiar to engineers in spite of their shortcomings, 
and the two most widely quoted are the average s/n and the segmental 
s/n. 9 In the former the ratio of the average signal power to the average 
error power is found. In determining segmental s/n the signal is divided 
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into segments or blocks, and the average signal to average error power 
is computed in decibels for each segment. Then s/n values of each 
segment are averaged to give segmental s/n. We decided to use 
segmental s/n, and proceeded to divide the speech into voiced or 
unvoiced segments. We made this division because the effect of ran- 
domizing the phase yields s/n values for a segment that critically 
depend on whether the segment contains voiced or unvoiced speech. 
A low s/n for unvoiced speech can be anticipated, as randomizing 
every phase component yields a time waveform that is radically 
different from the original segment of speech. The s/n for that block 
of speech is accordingly very low, and is often negative. However, 
because the magnitude of the spectra for the recovered and original 
speech signals are the same, these signals are perceived to be similar. 
In the case of voiced speech the effect of randomizing J spectral 
components without altering their magnitudes results in end of block 
distortion. This distortion is mitigated by employing median filtering 
as previously described. The end of block distortion is not significant 
with unvoiced speech because of the relatively small magnitudes of 
the spectral coefficients. By measuring the s/n of each voiced segment, 
we provide a measure of the end of block distortion. Thus, segmental 
s/n is a reasonable measure for voiced speech, and a poor measure for 
unvoiced speech, in that the value of the segmental s/n has a close 
correspondence with the perceived speech in the case of voiced speech, 
and vice versa for unvoiced speech. We note in passing that in wave- 
form encoding, like the situation here, the segmental s/n is usually 
high for voiced speech and low or negative for unvoiced speech. 10 

In our experiments we proceeded as follows. Assuming the channel 
to be ideal we determined the s/n of the recovered speech signal as a 
function of the number J of phase components discarded for voiced 
speech. Only blocks where Inequality (3) applied were used in the s/n 
calculation. We performed experiments for J measured over different 
coefficient ranges, e.g., from k c \ to higher values of k, about the center 
of the coefficient range, and from k C 2 to lower values of k. The location 
of the range of J caused different perceptual impairments in the 
recovered speech signal. From informal listening tests we concluded 
that the latter range for J was preferable, and, accordingly, we display 
in Fig. 5 the s/n for voiced speech as a function of J measured from k C 2 
to lower values of k. In determining the s/n, we employed the median 
filter having a length L of 3, and M = 5. As the curve in Fig. 5 was 
obtained for an ideal channel, it is independent of the values of [i v and 
H uv , factors introduced to avoid data errors in the presence of channel 
impairments. The exchange of s/n in decibels with J is given by 

s/n ^ 39 - 0.375J; 8 < J < 40, (24) 
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Fig. 5— Variation of s/n versus J for voiced speech. The s/n of unvoiced speech is 

also shown. 



i.e., a loss of 0.375 dB in s/n per phase component of voiced speech 
employed for the transmission of data. 

In the case of blocks containing unvoiced speech 98 phase compo- 
nents in each block are used for the conveyance of data (corresponding 
to 200 to 3200 Hz, N = 256), and the segmental s/n for these blocks is 
only —2 dB. As mentioned above, this low s/n for unvoiced speech was 
expected owing to the randomization of all the phase components in 
the recovered speech. However, the perceptual quality and intelligibil- 
ity of the recovered unvoiced speech is good as the magnitude spectra 
are maintained, and the excitation in unvoiced speech is noise-like. 
The s/n for unvoiced speech is shown in Fig. 5 as a horizontal line. 

For the sentences used, transmitted data rates of 1200, 1024, and 
852 b/s were achieved for J = 32, 24, and 16, respectively. 

To prevent the channel from being overloaded by excessive ampli- 
tude levels resulting from signal amplification owing to spectral com- 
ponent scaling, attenuation of {#} was performed. The attenuation 
was adjusted until the range of amplitude levels of the combined 
speech and data signal was the same as that of the input speech signal. 
Specifically, we found the block with the largest output amplitude 
whose magnitude expansion parameter was n, say. The attenuation in 
decibels was then set at 

J = 20 log 10 (r A ) 

for the whole speech signal. Channel noise [m] was next added to the 
transmitted signal, and for a constant channel noise power of minus P 



2960 THE BELL SYSTEM TECHNICAL JOURNAL, DECEMBER 1982 



dB below the mean square value of the input speech signal, the change 
in s/n relative to the s/n in the absence of spectral scaling was found 
as a function of p u for blocks where Inequality (3) applied. This was 
repeated for different values of P and two values of J to yield the 
curves shown in Fig. 6a and b. As we expected, when P becomes 
progressively more negative, the change in s/n, namely As/n, ap- 
proaches zero for all [i v . When the additive noise power P is high and 
J = 32, there is a loss in s/n that increases with p v , but never exceeds 
3 dB for the parameters shown in Fig. 6a. For J = 24, the loss in s/n 
is much smaller (=*1 dB), and for ji v below 100, As/n may be slightly 
positive. This small positive value of As/n arises because for low values 
of y,v, the spectral scaling of the J coefficients carrying data is insuffi- 
cient to cause the combined data and speech sequence {gi) to be 
attenuated. Thus, for a given channel noise power P the channel s/n 
decreases with the result that the recovered s/n is marginally en- 
hanced. When fi v exceeds 100, and the attenuation of {gi} is as 
described above, the channel s/n decreases, and As/n takes on negative 
values. When the experiment was repeated with blocks where Ine- 
quality (2) applied, As/n was always positive as shown in Fig. 6c. 
Observe that for unvoiced speech no attenuation of the combined 
speech and data signal need be imposed for fi uv < 400, as the signal 
does not exceed the levels found in voiced speech. Consequently, As/ 
n is nearly constant until /i u , > 400, whence attenuation of {g k } is 
employed. As/n decreases slowly with p uv , and As/n is marginally 
greater for P of -20 dB than -30 dB, i.e., -20 dB of channel noise is 
advantageous. However, the variation of As/n in Fig. 6 is not great, 
being positive for unvoiced speech and negative (in general) for voiced 
speech. 

With the attenuator adjusted as previously described such that the 
amplitude range of the transmitted signal and the original speech 
signal are the same, we observed an improvement in BER, defined as 

/BEB-20logJ^^J, (25) 

where BER,, and BER„ represent the BER when spectral scaling of 
value ji is used and when no spectral scaling is employed, respectively. 
The variation of IBER with \i for different values of noise power P is 
displayed in Fig. 7. The smallest value of P employed was -30 dB, as 
we did not have sufficient data for reliable results when P was more 
negative. We see from Figs. 7a and b that /*,. of the order of 250 is a 
good choice as it provides a large value of IBER while avoiding 
significant losses in s/n, as displayed in Figs. 6a and b. Thus, for J = 
24, P = -30 dB, a p v = 250 provides a gain in BER of =50 dB while 
sustaining a loss in recovered speech s/n of only 0.5 dB. As is expected, 
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versus /£_,_,. 



larger values of /i u ,. apply as shown in Fig. 7c, and a good choice of fx tw 
is 750. Using this j__.. value, P = -30 dB, we achieved IBER of =50 dB 
and a gain in s/n of 0.5 dB. Figures 6 and 7 highlight the desirable 
properties of spectral scaling, a large improvement in BER, and at 
worst a small loss in speech s/n. 
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The channel s/n was computed as 



s/n c = 10 logi 



lit 



In? 

1=1 



(26) 



where g is the attenuated version of g, and Wis the number of speech 
samples in the input speech signal. Although the same noise source 
was used as in the previous experiments, and the attenuator was 
employed, s/n c differs from the s/n of P dB computed using the input 
speech and the noise signal. This difference arises because {#,} is not 
identical to {x, } . The sequence { gi } depends on fi u and /i„„ , parameters 
which affect both r and p [see eqs. (22) and (23)]. However, the s/n 
differences are small, and typically are <3 dB. Figure 8 displays the 
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Fig. 8 — Variation of BER as a function of s/nc for different values of /i uu ; /i„ = 250, 
J =24. 
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variation of BER as a function of s/n c for different jti uu , and ji = 250. 
As we anticipated from Fig. 7c, increasing /i„,, from 500 to 1000 results 
in an increase in IBER, which means a decrease in BER. Further 
increases in /i U( . will decrease BER, but the reduction will not be great. 

We observe from Fig. 8 that for s/n c of 30 dB, the BER is 0.055 
percent, and as previously stated, the average transmitted bit rate for 
J = 24 is 1024 b/s. This BER can be reduced by using error-correcting 
codes. For example, if a BCH code is employed such that the number 
of error-correcting code bits equals the number of data bits, i.e., the 
average transmission rate is 512 b/s, the BER decreases to approxi- 
mately 10" 5 percent. The extent of the trading of the reduction in the 
average transmitted bit rate for improvements in BER depends on 
system requirements. In digital radio transmission outage occurs when 
the BER exceeds 0.01 percent. 

The variation of the segmental s/n of the recovered speech signal 
against s/n c is shown in Fig. 9 for three values of J, and /i„ = 250 and 
[i uv = 750. Only blocks satisfying Inequalities (2) and (3) were used in 
this calculation of segmental s/n. As s/n c approaches 50 dB we ap- 
proximate to the ideal channel condition, and by comparing the s/n 
values with those in Fig. 5 we may observe the deleterious effect of the 
unvoiced speech s/n on the overall s/n. Thus, for J = 16, 24, and 32, 
the s/n = 25, 23, and 20.5 dB in Fig. 9, whereas when only voiced 
speech is present the corresponding s/n = 32, 28, and 26 dB. However, 
the perceptual quality of the recovered speech is more suitably repre- 
sented by the segmental s/n for voiced speech than by the combined 
segmental s/n. Thus, the s/n values in Fig. 9 are lower than would be 
anticipated for the quality obtained. If no phase components had been 
used for the transmission of data, the variation of s/n with s/n c would 
be a straight line at 45 degrees, shown in Fig. 9. The offset of this line 
from the origin is due to the s/n of the speech being calculated as 
segmental s/n, 9 and s/n c is computed according to eq. (26). 
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We conclude this section by providing some waveforms and spectra 
of the input, transmitted, and recovered signals for ju„ = 250, /x ul , = 750, 
J = 24, and N = 256. In Fig. 10a twenty blocks of an arbitrary speech 
signal are shown, having two blocks of intraconversational silence. The 
combined speech and data sequence {gi} prior to attenuation is 
displayed in Fig. 10b, where it can be observed that the power level of 
the unvoiced speech is considerably amplified; high-amplitude, high- 
frequency components have been introduced into the voiced segments; 
and those parts of the silence that resided in blocks substantially 
occupied by voiced speech are carrying data. The recovered speech 
signal is displayed in Fig. 10c for the case of an ideal channel. The 
effect of replacing the data-carrying phase components by random 
ones does not cause serious degradations in the perceptual quality of 
the recovered speech. 

The magnitude of the spectral components of the waveforms in Figs. 
10a and b are shown in Figs. 11a and b, respectively. As data is carried 
by the phase components in the speech signal, the magnitude spectra 
of the waveforms in Fig. 10a and c are identical. The ju-law scaling of 
24 components for voiced speech is seen to substantially enhance its 
high-frequency components, whereas all 98 components across the 
speech band are scaled for the unvoiced speech. The /x-law scaling for 
voiced speech is seen to be reminiscent of frequency pre-emphasis. 

V. DISCUSSION 

A system has been proposed for the simultaneous transmission of 
speech and data on the phase of the speech signals, where the band- 
width of the transmitted signal is contained relative to that of the 
original speech. We knew at the outset that if data was to be conveyed 
on the phase of speech signals, the receiver would be forced to 
introduce phase components to replace those that had been discarded 
at the transmitter in favor of data. We postulated that if the introduced 
phase components were derived from a random number source, and 
that their values were confined between ±tt, then the perceptual 
degradation in speech quality might be acceptable. Our decision to 
randomize the values of the introduced phase components at the 
receiver was based on the knowledge that the variations in the values 
of the phase spectral components in speech, particularly unvoiced 
speech, tend to have random behavior. Further, the effect of phase 
distortion on monaural speech intelligibility is known to be small, the 
controlling factor being the amplitude spectra. Accordingly, we did 
experiments, and from informal listening experiences concluded that 
the randomization of all the phase components of unvoiced speech did 
not cause serious perceptual degradation. In the case of voiced speech 
we discovered that if too many phase components were randomized 
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Fig. 11 — Magnitude spectra, (a) The signal in Fig. 10a and c. (b) The signal in Fig. 
10b. 

the recovered speech quality was poor, and out of the 98 phase 
components available in our experiments we concluded that the max- 
imum number J of phase components that could be randomly per- 
turbed was 32. The position of the J components had different percep- 
tual effects, and we decided to make J span the range of the highest 
inband frequency components, although the actual position of J is far 
less important than its value. 

Deeming that randomization of the phase components as described 
was perceptually tolerable, particularly in the presence of channel 
noise when the channel s/n was approximately 30 dB, we decided to 
trade the loss in perceptual quality for the implantation of data into 
those components we had randomized. By this strategy, and for a 
channel s/n of 30 dB we have been able to achieve an average data 
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rate of 1 Kb/s on the assumption that one bit is assigned to each data- 
carrying phase component. To achieve this data rate we are required 
to tolerate a BER of 0.055 percent and an average s/n for voiced and 
unvoiced speech of 24 and —3 dB, respectively, the measurements 
being made over four sentences of speech. Observe that the BER can 
be reduced to a value acceptable for the user by applying channel- 
coding strategies that result in a reduction in the transmitted bit rate. 
An example of such a trade-off is given in Section IV. The recovered 
speech is below toll quality, but the ability to transmit data may make 
this quality reduction acceptable in certain situations. 

Making comparisons of this technique of conveying data on the 
phase of the speech signal with those employing scrambling methods 1,2 
is difficult because of the radically different approaches of these 
schemes. Embedding data in speech by scrambling can be made to 
have a very small bandwidth expansion by suitable choice of scram- 
bling code and block size. 2 Increasing the complexity of the scrambling 
algorithm and the number of bits per block of speech scrambled alters 
the systems performance in a way that is difficult to predict. 

The previously described system using scrambling techniques, 1,2 and 
the one described here have only been evaluated for noisy channels. 
Which system would perform best in an actual communications net- 
work, and what the requirements would be on channel equalization 
and synchronization are unknown quantities. What we can say is that 
errors in the samples at the receiver attributable to noise or imperfect 
channel equalization, the presence of an unwanted sample, and the 
loss of a wanted one owing to incorrect synchronization, are smeared 
over the spectral components by the DFT. The data here is binary 
and therefore considerable noise on the data-carrying phase can be 
tolerated. By using error detection and correction coding the data rate 
can be sacrificed to a value commensurate with a specified BER for a 
given set of channel impairments. 

However, our quest was not to investigate the numerous channel 
conditions. It was to determine if speech and data could be transmitted 
over a noisy channel by embedding the data in the phase of speech, 
and further, if the transmitted bit rate could be sufficiently high to be 
useful, the BER acceptably low, and the degradation in the recovered 
speech quality perceptible but not annoying. Our conclusion is affirm- 
ative. 
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